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A CONSISTENCY LEMMA IN STATISTICAL PHYLOGENETICS 


MIKE STEEL 


Abstract. This short note provides a simple formal proof of a folklore result in statistical 
phylogenetics concerning the convergence of bootstrap support for a tree and its edges. 


1. Definitions and preliminaries 

In this note T will refer to any rooted or unrooted phylogenetic tree, and T~ p will refer to the 
unrooted tree obtained from T by suppressing the root vertex p if it has one (i.e. if T is unrooted 
then T~ p = T). Let 9 be a vector of continuous parameters - including the branch lengths 
of T, along with possibly other continuous parameters required to specify a model of character 
evolution on T. Let 0 denote the set of values 9 may take. Branch lengths, in particular, 
are assumed to be strictly positive and finite; and in general 0 will be some open subset of 
Euclidean space. Consider any stochastic process (e.g. Markov process, or mixture of Markov 
processes) which assigns to each pair (T, 9) a probability distribution s = s (T,9) on discrete, 
finite-state characters at the tips of the tree. We assume throughout that the map 9 M- s(T, 9) is 
continuous. Such models are central to statistical phylogenetics and methods for reconstructing 
phylogenetic trees from aligned genetic (e.g. DNA) sequences. A tree reconstruction method tp is 
any method that reconstructs a set of one or more unrooted phylogenetic trees from any given 
distribution f of site pattern frequencies. Suppose we generate k sites i.i.d. from (T, 9 ), and let s 
be the random variable equal to the resulting proportion of site patterns (character types). The 
method ip is a statistically consistent estimator of the unrooted topology of T if the probability 
that ip(s) = {T~ p } converges to 1 as k — > ocQ. Suppose that ip satisfies the following condition: 

(*) For every tree T for which T~ p is fully-resolved (i.e. binary), and each 9 £ 0(T) a 
value e = C(t,8) > 0 exists for which the following inequality holds for every probability 
distribution f on site patterns: ||f — s(T, 0)|| < e =>■ ip(f) = {T~ p }. 

Here || ■ || denotes any of the usual norms in Euclidean space. Condition (*) implies the statis¬ 
tical consistency of 'ip for inferring T~ p since the i.i.d. assumption ensures that s converges in 
probability to s(T, 9) as k grows, and so: 

P(^(s) = {T~ p }) > P(||s - s(T, 0))|| < e(T,e)) -> 1, as k -A oo. 

Not only does condition (*) imply that ip(s(T,8)) = {T~ p } whenever T~ p is fully-resolved but 
(*) also implies the stronger condition that for any tree T’ that has a different unrooted topology 
(fully-resolved or non-fully-resolved) from the fully-resolved tree T we have: 

( 1 ) inf ||s(T,0)-s(T',0')ll >0, 

e'eO(T') 

a strong ‘identifiablity’ condition, referred to as ‘no touching’ in [J. 

Condition (*) is a type of local stability condition. It applies, for example, to distance-based 
tree reconstruction applied to (statistically consistent) ‘corrected distances’ derived from the 
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1 There is a slightly stronger definition involving almost sure convergence rather than convergence in probability, 
and the results here can be extended to that setting also. 
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characters, provided that the distance-reconstruction method has a positive ‘safety radius’, which 
holds for many (but not all) distance-based methods, including the popular Neighbor-Joining 
method [T.. Condition (*) also applies to MLE (maximum likelihood estimation) for models 
which satisfy m - such models include the general time-reversible (GTR) Markov processes and 
its submodels (e.g. Jukes-Cantor type models) and certain extensions of these models. Here 
MLE treats 9 as ‘nuisance parameters’ to be optimized as part of the search for the MLE tree; 
given a vector f as input, MLE selects the tree(s) T' maximizing sup ege ( T ,) P(f |s(T', 9)). The 
proof that Condition (*) holds for models satisfying m follows from standard analytic arguments 
based on the continuity of the map 9 i->- P(f|s(T', 9)) (see e.g. [2] or [3]). 

2. Result 

Given s derived from k i.i.d. site patterns, let s* denote the frequency of site patterns obtained 
by taking an i.i.d. sample of k site patterns using probability distribution s. Thus s* is the 
distribution of site patterns in a bootstrap sample from the original data. The bootstrap support 
of an edge e of an unrooted phylogenetic tree T', is the expected proportion of such bootstrap 
samples for which a tree, sampled uniformly at random from tf(s*), has an edge that induces 
the same split of the leaf taxa as e does in T' (it is a random variable by its dependence on s, 
and since if can return more than one tree). The bootstrap support for T' is the random variable 
P(?/’(s*) = {T'}|s), the expected proportion of bootstrap samples for which if returns the single 
tree T'. The following result was motivated by a question from T. Warnow (pers. comm.). 

Lemma 1. Suppose k sites are generated i.i.d. by s(T,9). Under the sufficient condition (*) for 
statistical consistency, the bootstrap support of every edge e of T~ p converges in probability to 1 
as k —> oo. Moreover, the bootstrap support for T~ p converges in probability to 1 as k —>• oo. 

Proof. Clearly it suffices to prove the second assertion in the lemma, since, by definition, the 
bootstrap support for any edge e of T~ p is at least P(^>(s*) = {T _p }|s). Let X = X(s) be the 0/1 
random variable which takes the value 1 precisely if if(s*) = {T~ p }, and which is 0 otherwise. 
Let Y denote the expected bootstrap support for T~ p given s; thus Y = F(if(s*) = (T _p }|s) = 
E[JT|s] (i.e. the conditional expectation of X given s). Notice that: 

(2) E [Y] = E[£[X|s]] = E[X] = F(if(s*) = {T~ p }). 

Now, as k grows, s A s, and s* — s A 0; thus s* A s. Consequently, by Condition (*), 
P(V’(s*) = {T~ p }) converges to 1 as k —> oo, and so, by (|2J|, lim^oo E[Y] = 1. Finally, since 
Y takes values in the interval [0,1], and the expected value of Y converges to 1 as k —> oo, it 
follows that (for the bootstrap support for T~ p ) we have Y A 1 as k —> oo, as required. □ 

Note that the empirical bootstrap support for an edge (or for a tree) given s, converges 
in probability to the (expected) bootstrap support value defined here, as the number N of 
independent bootstrap replicates becomes large; hence our results are also relevant for empirical 
bootstrap support for large N. 
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